Friday, May 20, 2011

Followup: ZFS Data Gone

In Feburary I blogged about a nasty data loss event with ZFS

http://christopher-technicalmusings.blogspot.com/2011/02/zfs-data-gone-data-gone-love-is-gone.html

I've been quite busy since then and haven't followed up on the results. As a few people have been asking if I was able to get the data back, here is my answer:

Yes, I did get most of it back, thanks to a lot of effort from George Wilson (great guy, and I'm very indebted to him). However, any data that was in play at the time of the fault was irreversibly damaged and couldn't be restored. Any data that wasn't active at the time of the crash was perfectly fine, it just needed to be copied out of the pool into a new pool. George had to mount my pool for me, as it was beyond non-ZFS-programmer skills to mount. Unfortunately Solaris would dump after about 24 hours, requiring a second mounting by George. It was also slower than cold molasses to copy anything in it's faulted state. If I was getting 1 Meg/Sec, I was lucky. You can imagine that creates an issue when you're trying to evacuate a few TB of data through a slow pipe.

After it dumped again, I didn’t bother George for a third remounting (or I tried very half-heartedly, the guy was already into this for a lot of time, and we all have our day jobs), and abandoned the data that was still stranded on the faulted pool. I copied my most wanted data first, so what I abandoned was a personal collection of movies that I could always re-rip.


I was still experimenting with ZFS at the time, so I wasn't using snapshots for backup, just conventional image backups of the VM's that were running. Snapshots would have had a good chance of protecting my data from the fault that I ran into.


I was originally blaming my Areca 1880 card, as I was working with Areca tech support on a more stable driver for Solaris, and was on the 3rd revision of a driver with them. However, in the end it wasn't the Areca, as I was very familiar with it's tricks - The Areca would hang (about once every day or two), but it wouldn't take out the pool. After removing the Arcea and going with just LSI 2008 based controllers, I had one final fault about 3 weeks later that corrupted another pool (luckily it was just a backup pool). At that point, the swearing in the server room reached a peak, I booted back into FreeBSD, and haven't looked back.

Originally when I used the Areca controller with FreeBSD, I didn't have any problems with it during the 2 month trial period.

I've had only small FreeBSD issues since then, nothing else has changed on my hardware. So the only claim I can make is that in my environment, on my hardware, I've had better stability with FreeBSD than I did with Solaris.

Interesting Note: One of the speed slow-downs with FreeBSD compared to Solaris from my tests was the O_SYNC method that ESX uses to mount a NFS store. I edited the FreeBSD NFS source to always do a async write, regardless of the O_SYNC from the client, and that perked FreeBSD up a lot for speed, making it fairly close to what I was getting on Solaris.

I'm not sure why this makes such a difference, as I'm sure Solaris is also obeying the O_SYNC command. I do know that the NFS 3 code from FreeBSD is very old, and a bit cluttered - It could just be issues there, and the async hack gives it back speed it loses in other areas.

FreeBSD is now using a 4.1 NFS server by default as of the last month, and I'm just starting my stability tests with using a new FreeBSD-9 build to see if I can run newer code. I'll do speed tests again, and will probably make the same hack to the 4.1 NFS code to force async writes. I'll post to an update when I get this far.

I do like Solaris - After some initial discomfort about the different way things were being done, I do see the overall design and idea, and I now have a wish list of features I'd like see ported to FreeBSD. I think I'll have a Solaris based box setup again for testing. We'll see what time allows.

4 comments:

  1. Hello Christopher, I'm experiencinng a ZFS data loss problem of my own right now and feel like I'm running out of ideas.. anyway you can put my in touch with George? You can see the details of my woes here: http://www.nexentastor.org/boards/2/topics/8502

    If you get this and can help please post on that thread..

    ReplyDelete
    Replies
    1. This is a little late, but you can always track down George at Delphix. I rather not passing on an email address without permission.

      He also has an informative ZFS blog there which I recommend for reading.

      I've also been working on some Python tools for repairing ZFS pools, time will tell if I can get them to a publishable stage.

      Delete
  2. Hello Christopher,
    Do you have this tool ready ?

    ReplyDelete
    Replies
    1. Hi Dans,

      Sorry, it's still on the back-burner as more important items have come up. We are able to do rather advanced zfs recoveries now, so the tools evolved into something useful, but only for our in-house use.

      Delete